ML Course Part 1 - Introduction to Machine Learning

Author

Alexandre Bry

1 Definitions

1.0.1 Machine Learning (ML)

Machine Learning (ML)

Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.

ML Model

Computer object implementing a ML algorithm, trained on a set of data to perform a given task.

ML is really about learning to make extrapolations on a given task.

1.0.2 Neural Network (NN)

Neural Network (NN)

Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.

Common representation of a neural network[1]

Common representation of a neural network[1]

1.0.3 Deep Learning (DL)

Deep Learning (DL)

Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.

1.1 Categories of ML

1.1.1 Type of dataset

The four main categories of ML are based on the type of dataset used to train the model:

Supervised Unsupervised Semi-supervised Reinforcement
Input Data Data Data Environment
Ground-truth Yes No Partial No (reward)
Examples Classification, Regression Clustering Anomaly detection Game playing
  • Supervised: for each input in the dataset, the expected output is also part of the dataset
  • Unsupervised: for each input in the dataset, the expected output is not part of the dataset
  • Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
  • Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions

1.1.2 Type of output

Another way to categorize ML models is based on the type of output they produce:

Category Description Example Outputs Example Use Cases
Classification Assign one (or multiple) label(s) chosen from a given list of classes to each element of the input. “Cat”, “Dog”, “Bird” Spam detection, Image recognition
Regression Assign one (or multiple) value(s) chosen from a continuous set of values. 3.5, 7.2, 15.8 Stock price prediction, Age estimation
Clustering Create categories by grouping together similar inputs. Cluster 1, Cluster 2 Customer segmentation, Image compression
Anomaly Detection Detect outliers in the dataset. Normal, Outlier Fraud detection, Fault detection
Generative Models Generate new data similar to the training data. Image, Text, Audio Image generation, Text completion
Ranking Arrange items in order of relevance or importance. Rank 1, Rank 2, Rank 3 Search engine, Recommendation system
Reinforcement Learning Learn a policy to maximize long-term rewards through interaction with an environment. Policy, Action sequence Game playing, Robotics control
Dimensionality Reduction Reduce the number of features while retaining meaningful information. 2D or 3D projection Visualization, Data compression

1.2 Dataset

1.2.1 Dataset - Definition

Dataset

A collection of data used to train, validate and test ML models.

1.2.2 Dataset - Example

Dataset example

sepal length sepal width petal length petal width species
0 5.7 3.8 1.7 0.3 0
1 6.3 3.3 6.0 2.5 2
2 6.9 3.2 5.7 2.3 2
... ... ... ... ... ...
147 4.8 3.4 1.9 0.2 0
148 6.4 2.7 5.3 1.9 2
149 4.6 3.4 1.4 0.3 0

150 rows × 5 columns

1.2.3 Content - Definitions

Instance (or sample)

An instance is one individual entry of the dataset (a row).

Feature (or attribute or variable)

A feature is a piece of information that the model uses to make predictions.

Label (or target or output or class)

A label is a piece of information that the model is trying to predict.

Feature vs. Label

Features and labels are simply different columns in the dataset with different roles.

1.2.4 Content - Example

Instances, features and labels

Feature 1 Feature 2 Feature 3 Feature 4 Label
Instance 0 5.7 3.8 1.7 0.3 0
Instance 1 6.3 3.3 6.0 2.5 2
Instance 2 6.9 3.2 5.7 2.3 2
... ... ... ... ... ...
Instance 147 4.8 3.4 1.9 0.2 0
Instance 148 6.4 2.7 5.3 1.9 2
Instance 149 4.6 3.4 1.4 0.3 0

150 rows × 5 columns

1.2.5 Subset

Dataset subsets

A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:

  • Training set: used during training to train the model,
  • Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
  • Test set: used after training to evaluate the performance of the model on new data it has not encountered before.

Metaphor of studies: exercises, past years exams and real exam

2 Overview of ML methods

2.1 Supervised Learning

2.1.1 Decision Trees

A tree-like structure used for both classification and regression

2.1.2 Decision Trees - Example

2.1.3 Random Forests

An ensemble method that combines multiple decision trees:

  • Train independently \(B\) trees using:
    • Bagging: each tree is fitted on a random subset of the training set
    • Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
  • Take a decision by aggregating individual decisions of each tree

2.1.4 Random Forests - Example

2.1.5 Support Vector Machines (SVM)

Used for classification and regression, effective in high-dimensional spaces:

  • Separate the feature space using optimal hyperplanes
  • Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space

SVM: map features to a higher dimensional space to be able to separate classes using a hyperplane[2]

SVM: map features to a higher dimensional space to be able to separate classes using a hyperplane[2]

The kernel trick is a way to map the features in a higher dimensional space without actually computing the new features.

2.1.6 Support Vector Machines (SVM) - Example

2.1.7 Other methods

Other methods for supervised learning include:

Method Description
Linear Regression Predicts a continuous value with a linear model
Logistic Regression Predicts a binary value with a linear model
K-Nearest Neighbors (KNN) Non-parametric method for classification and regression
Boosting Ensemble method (like Random Forests) that combines weak learners to form a strong model
Naive Bayes Probabilistic classifier based on Bayes’ theorem

2.2 Unsupervised Learning

2.2.1 K-Means Clustering

A method for partitioning data into \(k\) clusters:

  • \(k\) must be chosen a priori
  • The principle is to start with random centroids and then iteratively:
    1. Classify points using the closest centroid
    2. Move each centroid to the real centroid of its class
  • Classical method with lots of variation

k-Means Clustering convergence process[3]

\(k\)-Means Clustering convergence process[3]

2.2.2 Hierarchical Clustering

Builds a hierarchy of clusters using either agglomerative or divisive methods:

  • Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
  • Create any number of clusters by cutting the tree

Hierarchical Clustering[4]

Hierarchical Clustering[4]

2.2.3 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Clustering based on the density of data points:

  • Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
  • Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))

DBSCAN Illustration[5]

DBSCAN Illustration[5]

DBSCAN Result[6]

DBSCAN Result[6]

2.2.4 Other methods

Other methods for unsupervised learning include:

Method Description
K-Means Partition data into \(k\) clusters
Hierarchical Clustering Build a hierarchy of clusters
DBSCAN Density-based clustering that groups points closely packed together
Gaussian Mixture Models (GMM) Probabilistic clustering assuming data is generated from multiple Gaussian distributions
Principal Component Analysis (PCA) Reduce dimensionality by finding principal components that explain variance
t-SNE Nonlinear dimensionality reduction for visualizing high-dimensional data
Autoencoders Neural networks that learn efficient representations of data in an unsupervised manner
Self-Organizing Maps (SOM) Neural network-based method for clustering and visualization

2.3 Dimensionality Reduction

2.3.1 Principal Component Analysis (PCA)

Dimensionality reduction technique to project data into lower dimensions:

  • Project data into a space of lower dimension
  • Keep as much variance (so as much information) as possible

PCA Illustration[7]

PCA Illustration[7]

2.3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)

A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data

t-SNE Result on MNIST dataset[8]

t-SNE Result on MNIST dataset[8]

3 Neural Networks

3.0.1 Main points

  • Neural networks are a type of ML model inspired by the brain
  • A NN is composed of layers of nodes (or neurons)
  • Each node takes inputs, sums them up with weights and applies an activation function to produce an output
  • There are many kinds of layers, the main ones being:
    • Dense (or fully connected)
    • Convolutional
    • Recurrent
    • Pooling
  • What defines a NN is:
    • Its architecture ie the succession of layers and their parameters
    • Its weights ie the values used to sum the inputs
  • The weights are learned during the training process, which consists in:
    • Forward pass: compute the output of the model
    • Backward pass: compute the gradient of the loss function with respect to the weights
    • Optimization: update the weights using the gradient
    • Repeat until the model performs well (on a validation set)
    • Evaluate the model on a test set

4 Usual pipeline

4.0.1 Overview

  1. Data acquisition
  2. Data preprocessing
  3. Model selection
  4. Model evaluation
  5. Final model training

4.0.2 Data acquisition

Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.

4.1 Data preprocessing

4.1.1 Different issues

Multiple sources of issues and steps to perform:

  1. Handle different formats
  2. Remove outliers (mostly for raw data)
  3. (Optionally) extract features
  4. Handle missing data
  5. Normalize

4.1.2 Why normalization?

Idea

A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.

Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:

\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]

4.1.3 Model selection

  • Type of model (ML, NN, DL, …)
  • Complexity:
    • Number of features
    • Type of output
    • Size of the layers (for NN)
    • Number of layers (for NN)
  • Hyperparameters

4.1.4 Model optimization/training

  • Loss selection: depends on the task, the objectives, the specific issues to solve
  • Training process selection (lots of different tweaks and improvements can be implemented in NN training)
  • Hyperparameter tuning, by repeatedly:
    • Selecting one or multiple configurations of hyperparameters
    • Training the model one or multiple times
    • Determining the best hyperparameters

4.1.5 Model evaluation - Criteria

Criteria selection among the many possible ones:

  • For classification:
    • Accuracy: for balanced datasets
    • Precision: when false positives are costly
    • Recall: when false negatives are costly
    • F1-Score: when class distribution is unbalanced
  • For regression:
    • Mean Absolute Error (MAE)
    • Mean Square Error (MSE): more sensitive to large errors than MAE

4.1.6 Model evaluation - Cross-validation

Cross-validation

Method to estimate real performance of the model by:

  1. Splitting the dataset in multiple parts (usually 5)
  2. For different combinations of these parts (usually 5), training and evaluating the model

Cross-validation[9]

Cross-validation[9]

4.1.7 Final model training

Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.

5 Challenges

5.1 Data

5.1.1 Quality

Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:

  • Raw data quality: the input data must possess high enough details for the task to be even achievable. Be careful however as more features imply larger models which are longer and harder to train.
  • Annotations quality: the annotations must be as precise and correct as possible in the context of the task at hand. Every blunder or outlier in a supervised dataset will slow down training and might result in unexpected behaviors of the trained model.

5.1.2 Diversity

Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:

  • A well-defined task is crucial to identify all the various cases that we want our model to handle. Being as exhaustive as possible when selecting the training instances will accelerate training and improve the model by a lot
  • Balancing the dataset can also improve the training. When training on imbalanced datasets (i.e. when some cases are much more represented than others), the model will focus on the most represented situations as it will be the easiest and quickest way to get better results. There are ways of correcting this phenomenon, but it is always better to avoid it if possible when building the dataset.

5.1.3 Biases and fairness

Biased

Refers to a model which always makes the same kind of wrong predictions in similar cases.

In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.

However, there exist model architectures, training and evaluation methods to prevent and detect biases, which can sometimes allow to build unbiased models using biased data. But this needs to be well-thought and won’t happen unless

5.2 Underfitting and Overfitting

5.2.1 Definitions

Underfitting

When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.

Overfitting

When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.

5.2.2 Illustrations

Underfitting and overfitting on a regression task[10]

Underfitting and overfitting on a regression task[10]

Underfitting and overfitting on a classification task[10]

Underfitting and overfitting on a classification task[10]

5.2.3 Solutions

Solution Underfitting Overfitting
Complexity Increase Reduce
Number of features Increase Reduce
Regularization Reduce Increase
Training time Increase Reduce

General strategies:

  • Cross-validation to identify problems
  • Grid search/random search to tune hyperparameters and balance between underfitting and overfitting
  • Ensemble methods to reduce overfitting by using many smaller models instead of one big

5.3 Interpretable and Explainable

5.3.1 Definitions

Interpretable

Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.

Explainable

Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.

6 Python libraries

6.0.1 Data manipulation

  • NumPy
    • Fast numerical operations
    • Matrices with any number of dimensions (called arrays)
    • Lots of convenient operators on arrays
  • Pandas
    • Can store any type of data
    • 1D or 2D tables (called DataFrames)
    • Lots of convenient operators on DataFrames

6.0.2 ML

  • SciPy
    • Scientific and technical computing based on NumPy
    • Lots of tools for optimization, integration, interpolation, etc.
  • scikit-learn
    • Simple and efficient tools for data mining and data analysis
    • Built on NumPy, SciPy and matplotlib

6.0.3 Visualization

  • Matplotlib
    • 2D plotting library (3D also possible)
    • Can create plots, histograms, power spectra, bar charts, error charts, scatterplots, etc.
    • Gallery
  • Plotly
    • Similar to Matplotlib but with interactive plots
    • Gallery
  • Seaborn
    • Data visualization library based on Matplotlib and Pandas
    • Very powerful for statistical data visualization
    • Gallery

7 Resources

7.1 References

[1] User:Wiso. “Neural network example.” Available at: https://commons.wikimedia.org/w/index.php?curid=5084582.
[2] Vector: Zirguezi Original: Alisneaky. “SVM kernel trick: Map features to a higher dimensional space to be able to separate classes using a hyperplane.” Available at: https://commons.wikimedia.org/w/index.php?curid=47868867.
[3] Chire. “K-means clustering convergence process.” Available at: https://commons.wikimedia.org/w/index.php?curid=59409335.
[4] derivative work: Mhbrugman Stathis Sideris. “Hierarchical clustering.” Available at: https://commons.wikimedia.org/w/index.php?curid=7344806.
[5] Chire. “DBSCAN illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=17045963.
[6] Chire. “DBSCAN result.” Available at: https://commons.wikimedia.org/w/index.php?curid=17085332.
[7] Nicoguaro. “PCA illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=46871195.
[8] Kyle McDonald. “T-SNE result on MNIST dataset.” Available at: https://commons.wikimedia.org/w/index.php?curid=115726949.
[9] Gufosowa. “Cross-validation.” Available at: https://commons.wikimedia.org/w/index.php?curid=82298768.
[10] GeeksforGeeks. “Underfitting and overfitting in machine learning.” Available at: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.